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Abstract 

Solomonoff sequence prediction is a scheme to predict digits of binary strings 
without knowing the underlying probability distribution. We call a prediction 
scheme informed when it knows the true probability distribution of the sequence. 
Several new relations between universal Solomonoff sequence prediction and in- 
formed prediction and general probabilistic prediction schemes will be proved. 
Among others, they show that the number of errors in Solomonoff prediction is 
finite for computable distributions, if finite in the informed case. Deterministic 
variants will also be studied. The most interesting result is that the deterministic 
variant of Solomonoff prediction is optimal compared to any other probabilistic or 
deterministic prediction scheme apart from additive square root corrections only. 
This makes it well suited even for difficult prediction problems, where it does not 
suffice when the number of errors is minimal to within some factor greater than one. 
Solomonoff 's original bound and the ones presented here complement each other in 
a useful way. 
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1 Introduction 

Induction is the process of predicting the future from the past or, more precisely, it is 
the process of finding rules in (past) data and using these rules to guess future data. 
The induction principle has been subject to long philosophical controversies. Highlights 
are Epicurus' principle of multiple explanations, Occams' razor (simplicity) principle and 
Bayes' rule for conditional probabilities |J. In 1964, Solomonoff || elegantly unified all 
these aspects into one formal theory of inductive inference. The theory allows the pre- 
diction of digits of binary sequences without knowing their true probability distribution 
in contrast to what we call an informed scheme, where the true distribution is known. A 
first error estimate was also given by Solomonoff 14 years later in ||. It states that the 
total means squared distance of the prediction probabilities of Solomonoff and informed 
prediction is bounded by the Kolmogorov complexity of the true distribution. As a corol- 
lary, this theorem ensures that Solomonoff prediction converges to informed prediction for 
computable sequences in the limit. This is the key result justifying the use of Solomonoff 
prediction for long sequences of low complexity. 

Another natural question is to ask for relations between the total number of expected 
errors E^ in Solomonoff prediction and the total number of prediction errors E^ in the 
informed scheme. Unfortunately |Sj does not bound in terms of E^ in a satisfactory 
way. For example it does not exclude the possibility of an infinite E^ even if is finite. 
Here we want to prove upper bounds to E% in terms of E^ ensuring as a corollary that 
the above case cannot happen. On the other hand, our theorem does not say much about 
the convergence of Solomonoff to informed prediction. So Solomonoff 's and our bounds 
complement each other in a nice way. 

In the preliminary Section || we give some notations for strings and conditional probability 
distributions on strings. Furthermore, we introduce Kolmogorov complexity and the 
universal probability, where we take care to make the latter a true probability measure. 

In Section |3| we define the general probabilistic prediction scheme (p) and Solomonoff (£) 
and informed (p) prediction as special cases. We will give several error relations between 
these prediction schemes. A bound for the error difference \Eg-Epl between Solomonoff 
and informed prediction is the central result. All other relations are then simple, but 
interesting consequences or known results such as the Euclidean bound. 

In Section |] we study deterministic variants of Solomonoff (0^) and informed (0 M ) pre- 
diction. We will give similar error relations as in the probabilistic case between these 
prediction schemes. The most interesting consequence is that the Og system is opti- 
mal compared to any other probabilistic or deterministic prediction scheme apart from 
additive square root corrections only. 

In the Appendices [S], |B] and |C] we prove the inequalities fllTTf) , (P0|) and (^), which are 
the central parts for the proofs of the Theorems 1 and 2. 



For an excellent introduction to Kolmogorov complexity and Solomonoff induction one 
should consult the book of Li and Vitanyi [|7| or the article || for a short course. Historical 
surveys of inductive reasoning/inference can be found in p|, W . 
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2 Preliminaries 

Throughout the paper we will consider binary sequences/strings and conditional proba- 
bility measures on strings. 

We will denote strings over the binary alphabet {0, 1} by s with Xk € {0, 1} 

and their lengths with l(s) =n. e is the empty string, x n:m := x n x n+ \. . .x m -\X m for n < m 
and e for n > m. Furthermore, x <n := Xi...x n _i. 

We use Greek letters for probability measures and underline their arguments to indicate 
that they are probability arguments. Let p n ( xi...x n ) be the probability that an (infinite) 
sequence starts with x\...x n . We drop the index on p if it is clear from its arguments: 

J2 P( X l-n) = yiPnQEl:n) = Pn-l(x <n ) = p(x<n), P(e) = Pq(c) = 1. (1) 

x„e{o,i} x " 

We also need conditional probabilities derived from Bayes' rule. We prefer a notation 
which preserves the order of the words in contrast to the standard notation p(-|-) which 
flips it. We extend the definition of p to the conditional case with the following convention 
for its arguments: An underlined argument Xk_ is a probability variable and other non- 
underlined arguments Xk represent conditions. With this convention, Bayes' rule has the 
following look: 

) = P(x 1:n )/p(x <n ) and 

(2) 

p( xi...x n ) = p(x 1 )-p(x 1 x 2 )-...-p(x 1 ...x n - 1 x n ). 

The first equation states that the probability that a string x\...x n ^\ is followed by x n is 
equal to the probability that a string starts with x\...x n divided by the probability that 
a string starts with x\...x n -\. The second equation is the first, applied n times. 

Let us choose some universal monotone Turing machine U with unidirectional input and 
output tapes and a bidirectional work tape. We can then define the prefix Kolmogorov 
complexity ||, [5| as the length of the shortest program p, for which U outputs string s: 

K(s) := mm{l(p) : U(p) = s}. (3) 

The universal semi-measure M(s) is defined as the probability that the output of the 
universal Turing machine U starts with s when provided with fair coin flips on the input 
tape. It is easy to see that this is equivalent to the formal definition 

M(s) := £ 2-'« (4) 

p : 3lu:U(p)=suj 

where the sum is over minimal programs p for which U outputs a string starting with 
s. U might be non-terminating. M has the important universality property [JE^ that it 
majorizes every computable probability measure p up to a multiplicative factor depending 
only on p but not on s: 

p(s) < 2 K{p)+0(l) M(s). (5) 
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The Kolmogorov complexity of a function like p is defined as the length of the shortest 
self-delimiting coding of a Turing machine computing this function. Unfortunately M 
itself is not a probability measure on the binary strings. We have M(sO)+M(sl) < M(s) 
because there are programs p which output just s, followed neither by nor by 1; they 
just stop after printing s or continue forever without any further output. This drawback 
can easily be corrected 1 || . Let us define the universal probability measure £ by defining 
first the conditional probabilities 

and then by using (|2|) to get £( xi...x n ). It is easily verified by induction that £ is indeed 
a probability measures and universal 

p(s) < 2 K M + °WZ(s). (7) 

The latter follows from £(s) > M(s) and The universality property (|7|) is all we need 
to know about £ in the following. 



3 Probabilistic Sequence Prediction 

Every inductive inference problem can be brought into the following form: Given a string 
x, give a guess for its continuation y. We will assume that the strings which have to be 
continued are drawn according to a probability distribution^. In this section we consider 
probabilistic predictors of the next bit of a string. So let p( xi...x n ) be the true probability 
measure of string x\. n , x k G {0, 1} and p(x <n x n ) be the probability that the system predicts 
x n as the successor of x\...x n -\. We are not interested here in the probability of the next 
bit itself. We want our system to output either or 1. Probabilistic strategies are useful 
in game theory where they are called mixed strategies. We keep p fixed and compare 
different p. Interesting quantities are the probability of making an error when predicting 
x n , given x <n . If x n = 0, the probability of our system to predict 1 (making an error) is 
p(a ; <nl)=l — p{x<nQ.)- That x n is happens with probability p(x <n 0). Analogously for 
0«-»l. So the probability of making a wrong prediction in the n th step (x <n fixed) is 

€-np(x<n) '■— ^{. x <n3Ln) [1 ~ P(%<n2L n )] • (8) 

x„6{Q,l} 

The total /i-expected number of errors in the first n predictions is 

n 

E n P ■= Yl £ Kx <k )-e kp {x <k ). (9) 

k=l x 1 ...x k _ 1 

1 Another popular way is to keep M and sacrifice some of the axioms of probability theory. The reason 
for doing this is that M, although not computable [0, is at least enumerable. On the other hand, we 
are interested in conditional probabilities, derived from M , which are no longer enumerable anyway, so 
there is no reason for us to stick to M. £ is still computable in the limit or approximable. 

2 This probability measure /i might be 1 for some sequence Xi :oo and for all others. In this case, 
K(p, n ) is equal to K{x\ :n ) (up to terms of order 1). 
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If p is known, a natural choice for p is p — p. This is what we call an informed prediction 
scheme. If the probability of x n is high (low), the system predicts x n with high (low) 
probability. If p is unknown, one could try the universal distribution £ for p as defined in 
(|j) and (P). This is known as Solomonoff prediction ||. 

What we are most interested in is an upper bound for the /z-expected number of errors 
E n £ of the ^-predictor. One might also be interested in the probability difference of 
predictions at step n of the p- and ^-predictor or the total absolute difference to some 
power a (a-norm in n-space). 



d k( x <k) ■= ^Z^{ x <kX k )-\i(x <k x k ) - p(x <k x k ) = £{x <k 0) - p{x <k 0) 

Xk 

n 

A i Q) : = EE K x <k)-d k ( x <k), a = 1,2 (10) 

fc=l x <k 

For a = 2 there is the well known-result || 

A^ 2) < ~ \n2-K(p) < oo for computable p. (11) 

One reason to directly study relations between E n ^ and E nfl is that from (|TT]) alone it 
does not follow that E^ is finite, if E^^ is finite. Assume that we could choose p 
such that 

&nfi ~ ~\- / and e n £ ~ l/n. Then Eqq^ would be finite, but -Eoo£ would be 
infinite, without violating (|TTD . There are other theorems, the most prominent being 
£(x <n x n ) I p(x <n x n ) ^5 1 with p probability 1 (see page 332). However, neither of 
them settles the above question. In the following we will show that a finite E^,^ causes a 
finite E^. 

Let us define the Kullback Leibler distance |4j] or relative entropy between p and £: 

K(x <n ) := K x <nX n ) In ^ X<n ~ n . (12) 

x„ ? \ x <n-£-n) 

H n is then defined as the sum-expectation for which the following can be shown |J 

H n ■= EEM^)'^M = EE^M^TT^T = ( 13 ) 



k=l x<k k =lx 1]k 



C{ x <kX.k) 



x\., n k =l^\ X <k X .k) xun sUElmJ 

In the first line we have inserted ( p~2| ) and used Bayes rule p(x <k )-p(x <k x k ) =p{x_i. k ). Due 
to (0) we can replace J2x Vk ^( x i:k) by Z) xi . n A*(^i :n ) as the argument of the logarithm is 
independent of x k+ \. n . The sum can now be exchanged with the X\. n sum and transforms 
to a product inside the logarithm. In the last equality we have used the second form 
of Bayes rule (Q) for p and £. If we use universality (|7|) of £, i.e. ln/i(x 1:n )/^(x 1:ri ) < 
ln2- K(p n ) + 0(1), the final inequality in (|i~3"D is yielded, which is the basis of all error 
estimates. 



We now come to our first theorem: 
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Theorem 1. Let there be binary sequences x\X2--- drawn with probability p n (x_i :n ) for 
the first n bits. A p-system predicts by definition x n from x <n with probability p(x <n x n ). 
e np (x <n ) is the error probability in the n th prediction and E np is the p-expected total 
number of errors in the first n predictions (|J). The following error relations hold between 
universal Solomonoff (p = £), informed (p = p) and general (p) predictions: 



i) 




-E r 


ii) 


A( 2 ) 

n 


< 


Hi) 


E n £ 


> 


iv) 


E n £, 


> 


v) 


E ni x 


< 


Vl) 


E n £ 


< 



n/j, I 



< 

Al 2 N 



\E, n 



2E, 



E, 



< H n + 



■"np 
J np 



np 



2E„.„ + H n + 



^En^Hji 

< 2c 
^E n pH n 



^E np H n 



> H n 
for any p 
for any p, 



for 



E n p ^* 2H n 



where H n < In 2- K(p) +0(1) is the relative entropy ( ]T^ ) and K(p) is the Kolmogorov 
complexity of p ^). 



Corollary 1. For computable p, i.e. for K(p) < oo, the following statements immedi- 
ately follow from Theorem 1 : 

vii) if -Eoo^t Is finite, then E^ is finite 

vm) E ni /E n , = 1 + 0(E~^) E "-^°° 1 

ix) E n ^ - E nil = O(^Je^) 

x) E ni /E np < 2 + 0(E^). 

Relation (i) is the central new result. It is best illustrated for computable p by the 
corollary. Statements (vii), (viii) and (ix) follow directly from (i) and the finiteness of 
Hoc. Statement (x) follows from (vi). 

First of all, (vii) ensures finiteness of the number of errors of Solomonoff prediction, 
if the informed prediction makes only a finite number of errors. This is especially the 
case for deterministic p, as E nfJj = in this casef]. Solomonoff prediction makes only 
a finite number of errors on computable sequences. For more complicated probabilistic 
environments, where even the ideal informed system makes an infinite number of errors, 
(ix) ensures that the error excess of Solomonoff prediction is only of order ^E n)1 . This 
ensures that the error densities E n /n of both systems converge to each other, but (ix) 
actually says more than this. It ensures that the quotient converges to 1 and also gives 
the speed of convergence (viii). 

Relation (ii) is the well-known Euclidean bound 0. It is the only upper bound in Theorem 
1 which remains finite for E np / p —»-oo. It ensures convergence of the individual prediction 
probabilities i(x <n x n ) ^ p(x <n x n ). Relation (Hi) shows that the £ system makes at least 
half of the errors of the p system. Relation (iv) improves the lower bounds of (i) and (Hi). 
Together with the upper bound in (i) it says that the excess of £ errors as compared to p 

3 We call a probability measure deterministic if it is 1 for exactly one sequence and for all others. 
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errors is given by H n apart from 0(j E niM H n ) corrections. The excess is neither smaller 
nor larger. This result is plausible, since knowing p means additional information, which 
saves making some of the errors. The information content of p (relative to £) is quantified 
in terms of the relative entropy H n . 

Relation (v ) states that no prediction scheme can have less than half of the errors of 
the p system, whatever we take for p. This ensures the optimality of p apart from a 
factor of 2. Combining this with (i) ensures optimality of Solomonoff prediction, apart 
from a factor of 2 and additive (inverse) square root corrections (vi), (x). Note that 
even when comparing £ with p, the computability of p is what counts, whereas p might 
be any, even an uncomputable, probabilistic predictor. The optimality within a factor 
of 2 might be sufficient for some applications, especially for finite or if E nfM /n^0, 
but is inacceptable for others. More about this in the next section, where we consider 
deterministic prediction, where no factor 2 occurs. 

Proof of Theorem 1. The first inequality in {€) follows directly from the definition of E n 
and A n and the triangle inequality. For the second inequality, let us start more modestly 
and try to find constants A and B which satisfy the linear inequality 

A« < A-E nfl + B-H n (14) 

If we could show 

d k {x <k ) < A-e kp (x <k ) + B-h k {x <k ) (15) 

for all k < n and all x <k , (|T4|) would follow immediately by summation and the definition 
of A n , E n and H n . With k, x <k , p, £ fixed now, we abbreviate 

y := p{x <k l) , 1 - y = p{x <k 0) 

z:=i{x <k l) , l-z = £(x <k 0) (16) 
r := p(x <k l) , 1 - r = p(x <k 0). 

The various error functions can then be expressed by y, z and r 

e kp = 2y(l - y) 

= y(l - z) + (1 - y)z 

e kp = y(l - r) + (1 - y)r (17) 

d k = \y — z\ 

h k = y lnf + (l-y)lnief. 



Inserting this into (|T5| ) we get 

\y-z\ < A-2y(l-y) + B- 



y In - + (1 - y) ln^ — - 

z 1 — z 



In Appendix |A] we will show that this inequality is true for B> tt + I, A>0. Inequality 
( pf ) therefore holds for any A>0, provided we insert B = jr+l. Thus we might minimize 



the r.h.s of (|i~4|) w.r.t A. The minimum is at A = J H n /2E nfl leading to the upper bound 



A« < H n +J2E n(1 H n 
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which completes the proof of (i). 

Bound (ii) is well known |J. It is already linear and is proved by showing d\ < \h n . 
Inserting the abbreviations dT7| ) we get 

2{y-zf < y \n y - + {l-y)\n l -pL (19) 

z 1 — z 

This lower bound for the Kullback Leibler distance is well known ||. 

Relation (Hi) does not involve H n at all and is elementary It is reduced to e n ^ > d^+^e nfl , 
equivalent to z(l — y) + y(l — z) > (y — z) 2 + y{l — y), equivalent to z{l — z) > 0, which 
is obviously true. 

The second inequality of (iv) is trivial and the first is proved similarly to (i). Again we 
start with a linear inequality — E n ^ < (A — l)E ntJi + (B — l)H n , which is further reduced 
to — e^g < (A — l)efc M + (B — l)hk. Inserting the abbreviations (|17|) we get 



y(l _ Z ) _ z (l _ y) < ( A _ 1)23/(1 -y) + (B-l) 



y In — + (1 — y) In 



(20) 



z x ' 1- 

In Appendix [FJ this inequality is shown to hold for 2AB > 1, when B > 1. If we insert 
I? = 1/2A and minimize w.r.t. A, the minimum is again at A = J H n /2E nfl leading to the 



upper bound —E n ^ < —E nii — H n + ^J2E n ^H n restricted to E nii > 2H n , which completes 
the proof of (iv). 

Statement (t>) is satisfied because 2y(l — y) < 2[y{\ — r) + (1 — y)r). Statement (vi) is a 
direct consequence of (i) and (v ). This completes the proof of Theorem 1. □ 



4 Deterministic Sequence Prediction 

In the last section several relations were derived between the number of errors of the 
universal ^-system, the informed //-system and arbitrary p-systems. All of them were 
probabilistic predictors in the sense that given x <n they output or 1 with certain prob- 
abilities. In this section, we are interested in systems whose output on input x <n is 
deterministically or 1. Again we can distinguish between the case where the true dis- 
tribution fi is known or unknown. In the probabilistic scheme we studied the fi and the 
£ system. Given any probabilistic predictor p it is easy to construct a deterministic pre- 
dictor O p from it in the following way: If the probability of predicting is larger than |, 
the deterministic predictor always chooses 0. Analogously for 0<->T. We define?] 

r\ ( \ C\( ( \ 1 \ / iOT p(2'<n3iri) 5 

w P {x <n x n) :- V( P {x <n x n ) - - 2 ) :- | l for p ^ x<n ^ > |_ 

Note that every deterministic predictor can be written in the form p for some p and that 
although Q p ( xi...x n ), defined via Bayes' rule (0), takes only values in {0, 1}, it may still 

4 All results will be independent of the choice for p = i, so one might choose for definiteness. 



MARCUS HUTTER, TECHNICAL REPORT, IDSIA-11-10 



8 



be interpreted as a probability measure. Deterministic prediction is just a special case of 
probabilistic prediction. The two models 9^ and 0£ will be studied now. 

Analogously to the last section we draw binary strings randomly with distribution p and 
define the probability that the P system makes an erroneous prediction in the n th step 
and the total //-expected number of errors in the first n predictions as 

e n©p(^<n) := /i(x< n X ra ) [1 — Qp(ff<n-£. w )] 

(21) 

E n0 P ■= X)S Kx <k )-e k e p (x <k ). 

k=l x<k 



The definitions ([12]) and (|13|) of h n and H n remain unchanged (£ is not replaced by 9^). 
The following relations will be derived: 



Theorem 2. Let there be binary sequences drawn with probability p n {3Li :n ) f° r the hrst 
n bits. A p-system predicts by dehnition x n from x <n with probability p(x <n x n ). A 
deterministic system Q p always predicts 1 if p(x <n x n ) > ~ and otherwise. If e np (x <n ) is 
the error probability in the n th prediction, E np the total p-expected number of errors in 
the hrst n predictions (^), the following relations hold: 

i) < E n e ( -E neil = J2 Xk A i fe<fc)|ene ? - e„ e J < H n + ^4E ne ^H n + HI 

H) Ke, < E np , e„e M < e np for any p 

in) E n@i < E np + H n + ^4E np H n + Hi for any p, 



where H n < ln2- K(p) + 0(1) is the relative entropy 
p. 

No other useful bounds have been found, especially no bounds for the analogue of A, 



15), which is Gnite for computable 



Corollary 2. For computable p, i.e. for K(p) < oo, the following statements immedi- 
ately follow from Theorem 2: 

vii) if EooQ^ is Gnite, then E^® is Gnite 

— 1 /2 

viii) E ne jE n@ii = 1 + Q{E n J^ ) — >1 for E n@ii -> oo 
ix) E ne ^ - £ n e M = 0(yjE neil ) 
x) E ne jE np < 1 + 0(E^). 



Most of what we said in the probabilistic case remains valid here, as the Theorems and 
Corollaries 1 and 2 parallel each other. For this reason we will only highlight the differ- 
ences. 

The last inequality of (i) is the central new result in the deterministic case. Again, it is 
illustrated in the corollary, which follows trivially from Theorem 2. 
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From (ii) we see that 0^ is the best prediction scheme possible, compared to any other 
probabilistic or deterministic prediction p. The error expectation e n e M is smaller in every 
single step and hence, the total number of errors are also. This itself is not surprising and 
nearly obvious, as the Q p system always predicts the bit of highest probability. So, for 
known p, the Q p system should always be preferred to any other prediction scheme, even 
to the informed p prediction system. 

Combining (i) and (ii) leads to a bound (Hi) on the number of prediction errors of the 
deterministic variant of Solomonoff prediction. For computable p, no prediction scheme 
can have fewer errors than that of the 0£ system, whatever we take for p, apart from 



some additive correction of order ^jEnQ^. No factor 2 occurs as in the probabilistic 

case. Together with the quick convergence E~ p l 2 stated in (x), the 0g model should be 
sufficiently good in many applications. 

Example. Let us consider a critical example. We want to predict the outcome of a die 
colored black (=0) and white (=1). Two faces should be white and the other 4 should be 
black. The game becomes more interesting by having a second complementary die with 
two black and four white sides. The dealer who throws the dice uses one or the other die 
according to some deterministic rule. The stake s is $3 in every round; our return r is $5 
for every correct prediction. 

The coloring of the dice and the selection strategy of the dealer unambiguously determine 
p. p(x <n 0) is | for die 1 or | for die 2. If we use p for prediction, we will have made E np 
incorrect and n — E np correct predictions in the first n rounds. The expected profit will 
be 

P np := (n-E np )r-ns = (2n-hE np )%. (22) 
The winning threshold P np >0 is reached if E np /n< 1 — s/r = |. 

If we knew p, we could use the best possible prediction scheme 0^. The error fl2*T|) and 
profit (|22|) expectations per round in this case are 

ee M := e n9 ><n) = \ = ^ < \ , ^ = 1$ > (23) 

3 n 5 n 3 

so we can make money from this game. If we predict according to the probabilistic p 
prediction scheme (H) we would lose money in the long run: 



,:a(x <n ) = 2---- = - = ^ > - , ^ = --$< 

M < ' 3 3 9 n 5 ' n 9 



In the more interesting case where we do not know p we can use Solomonoff prediction £ 
or its deterministic variant 0g. From (viii) of Corollaries 1 and 2 we know that 

Pnt/Pn» = l+0(n- 1 ' 2 ) = PneJPnB,, 

so asymptotically the £ system provides the same profit as the p system and the 0^ system 
the same as the M system. Using the £ system is a losing strategy, while using the 0^ 
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system is a winning strategy. Let us estimate the number of rounds we have to play before 
reaching the winning zone with the 0^ system. P n e 6 >0 if E h q <(1 — s/r)n if 



E nQlt + H n + y/4:E neit H n + H* < (l-s/r)-n 

by Theorem 2 (z). Solving w.r.t. H n we get 

H < (1 - s/r - Enejnf 
n 2{1- s/r + E n@ Jn) 

Using H n <ln2- K(fi) + 0(1) and ( |23|) we expect to be in the winning zone for 

n > - S / T + 6e ^ -\n2-K(u) + 0(1) = 330 \n2-K(u) + 0(1). 
(1-s/r-eeJ 2 

If the die selection strategy reflected in \x is not too complicated, the 9^ prediction system 
reaches the winning zone after a few thousand rounds. The number of rounds is not 
really small because the expected profit per round is one order of magnitude smaller than 
the return. This leads to a constant of two orders of magnitude size in front of K(\x). 
Stated otherwise, it is due to the large stochastic noise, which makes it difficult to extract 
the signal, i.e. the structure of the rule \i. Furthermore, this is only a bound for the 
turnaround value of n. The true expected turnaround n might be smaller. 

However, every game for which there exists a winning strategy p with P np ~ n, 0£ is 
guaranteed to get into the winning zone for some n^K(fi), i.e. P n e e >0 for sufficiently 
large n. This is not guaranteed for the ^-system, due to the factor 2 in the bound (x) of 
Corollary 1. 

Proof of Theorem 2. The method of proof is the same as in the previous section, so we 
will keep it short. With the abbreviations ( |16|) we can write and eke^ i n the forms 

em, = y(l-e(*-i)) + (l-y)6(*-i) = |y-e(*-l)| 

e fc e M = y(l-e(y-l)) + (l-y)e(y-l) = wm{y,l-y}. { ] 

With these abbreviations, (ii) is equivalent to min{?/, 1 — y} <y(l — r) + (1 —y)r, which is 
true, because the minimum of two numbers is always smaller than their weighted average. 

The first inequality and equality of (z) follow directly from {ii). To prove the last inequal- 
ity, we start once again with a linear model 

E n e e < (A + l)E ne , + (B + l)H n . (25) 

Inserting the definition of E n and H n , using and omitting the sums we have to find 
A and B, which satisfy 



y-e(z-l)\ < (A + l)mm{y,l-y} + (B + l) 



yln- + (l-y) In- — - 
z 1 — z 



(26) 



In Appendix we will show that the inequality is satisfied for B > ^A + and A > 0. 
Inserting B = ^A + into (|25|) and minimizing the r.h.s. w.r.t. A, we get the upper bound 

E n e e < Ke M + H n + j4E nfl H n + H* for A 2 = Rn I . 

Statement (Hi) is a direct consequence of (i) and (ii). This completes the proof of Theorem 
2. □ 
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5 Conclusions 

We have proved several new error bounds for Solomonoff prediction in terms of informed 
prediction and in terms of general prediction schemes. Theorem 1 and Corollary 1 sum- 
marize the results in the probabilistic case and Theorem 2 and Corollary 2 for the deter- 
ministic case. We have shown that in the probabilistic case E n ^ is asymptotically bounded 
by twice the number of errors of any other prediction scheme. In the deterministic variant 
of Solomonoff prediction this factor 2 is absent. It is well suited, even for difficult predic- 
tion problems, as the error probability E@ e /n converges rapidly to that of the minimal 
possible error probability E®^/n. 



Acknowledgments: I thank Ray Solomonoff and Jiirgen Schmidhuber for proofreading 
this work and for numerous discussions. 



A Proof of Inequality (|tB|) 

QWith the definition 



f(y,z;A,B) := A-2y(l-y) + B- 



yln- + (1 -y) In - 1 



z v 1 



\y - A 



we have to show f(y, z; A, B) > for < y < 1, < z < 1 and suitable A and B. We do 
this by showing that / > at all extremal values, 'at' boundaries and at non-analytical 
points. / — > +oo for z —>■ 0/1, if we choose B > 0. Moreover, at the non-analytic point 
z = y we have f(y,y; A, B) = 2Ay(l — y) >0 for A>0. The extremal condition df/dz = 
for z^y (keeping y fixed) leads to 

g 

V = V* ■= z-[l - — (1 -zj\, s := sign(z-y) = ±1. 
Inserting y* into the definition of / and omitting the positive term £?[...], we get 

f(y*,z;A,B) > 2Ay*(l - y*) - \z - y*\ = ^z(l - z)- g(z; A, B) 

g(z;A,B) := 2A(B - s(l - z)){B + sz) - sB' 

We have reduced the problem to showing g > 0. Since s = ±1, we have g(z;A, B) > 
2A(B — 1 + z) (B — z) — B for B > 1. The latter is quadratic in z and symmetric in 
z <-> 1 — z with a maximum at |. Thus it is sufficient to check the boundary values 
0(0; A, B) = g(l; A, B) = 2A(B - l)B - B. They are non-negative for 2A(B - 1) > 1. 
Putting everything together, we have proved that / >0for5>^- + l and A>0. □ 

5 The proofs are a bit sketchy. We will be a little sloppy about boundary values y = 0/1, z — |, O(0), 
> versus >, and approaching versus at the boundary. All subtleties have been checked and do not spoil 
the results. As 0<£<1, therefore 0<z<l is strict. 
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B Proof of Inequality (|20|) 



The proof of this inequality is similar to the previous one. With the definition 



f(y,z;A,B) := (A-l)2y(l-y) + (B-l) 



yln — + (1 — y) In 



z v ' 1 



+y{l-z) + z{l-y) 



we have to show f(y, z; A, B) > for < y < 1, < z < 1 and suitable A and B. Again, we 
do this by showing that />0 at all extremal values and 'at' the boundary. /— >+oo for 
z — > 0, 1, if we choose £? > 1. The extremal condition df/dz = (keeping y fixed) leads 
to 

z-B 

y = y < ■= z- — r——— r, o < y* < 1. 

1 — 13 — 2z(l — Z) 

Inserting y* into the definition of / and omitting the positive term (B — 1)[. . .], we get 

f(y*,z;A,B) > 2Ay*(l - y*) - (2y* - l)(z - y*) = ^1^1^ -9^ A, B) 

g(z;A,B) := 2A(z - B){\ - z - B) - (B - l)(2z - l) 2 . 

We have reduced the problem to showing g>0. This is easy, since g is quadratic in z and 
symmetric in z^ 1 — z. The extremal value g>(|; A, B)=2A(B — |) 2 is positive for A>0. 
The boundary values g(0; A, B)=g(l; A, B) = (2AB - 1)(B - 1) are > for 2AB > 1. 
Putting everything together, we have proved that />0 for 2AB>1 and B>1. □ 



C Proof of Inequality ( |26f ) 

We want to show that 

\y-6(z-i)\ < (A + l)wm{y,l-y} + (B + l) [ylnf + (l- y )l n g 

The formula is symmetric w.r.t. y^l—y and z^l — z simultaneously, so we can restrict 
ourselves to 0<y<l and 0<2;<|. Furthermore, let B>—1. Using (p!9|), it is enough to 
prove 

f(y, z- A, B) := (A + 1) min{y, 1 - y} + (B + 1)2(3/ - ^) 2 - y > 

/ is quadratic in z\ thus for y < \ it takes its minimum at z = y. Since f(y, y; A, B) = 
Ay > for A > 0, we can concentrate on the case y > |. In this case, the minimum is 
reached at the boundary z = \. 

f(y,±;A,B) = (A + — y) + (B + l)2(y - \) 2 - y 

This is now quadratic in y with minimum at 

A + 2B + 4 „ , 4A5-A 2 -4 

^ =^Tiy' /fe - AB) = ^rr - 

for B > \A + i, A > 0, (=>• 5 > 1). □ 
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